Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auto Parallel] fix bugs for split_batches_for_accumulation && fix bu… #9217

Merged

Conversation

zhangyuqin1998
Copy link
Contributor

@zhangyuqin1998 zhangyuqin1998 commented Sep 29, 2024

…gs for enable_delay_scale_loss

PR types

Bug fixes

PR changes

Others

Description

A.修复动态图自动并行下,split_batches_for_accumulation与动手无法对齐的情况。如图
dae478d91bc7c7a10aa3bcc927793128

B.修复动态图自动并行下,enable_delay_scale_loss逻辑错误的问题。自动并行默认实现enable_delay_scale_loss,预期行为为:

  1. 每个micro batch计算出loss
  2. 反向传播
  3. 对mini batch内的loss进行求和
  4. 对loss进行scale,除以acc数

但当前动态图自动并行的行为为:

  1. 每个micro batch计算出loss
  2. 对该loss进行scale,除以acc数
  3. 反向传播
  4. 对mini batch内的loss进行求和

此外,由于enable_delay_scale_loss逻辑不完善,为自动并行增加开关

C. 修复静态图自动并行下,loss打印展示问题。loss预期的展示行为为:

  1. 对每个micro batch的loss求和
  2. 对求和结果进行scale,除以acc数

但当前静态图自动并行loss对打印展示行为为:

  1. 对每个micro batch的loss进行scale,除以acc数
  2. 进行求和

D. 修复函数名称错误,将traning修正为training

Copy link

paddle-bot bot commented Sep 29, 2024

Thanks for your contribution!

Copy link

codecov bot commented Sep 29, 2024

Codecov Report

Attention: Patch coverage is 2.77778% with 35 lines in your changes missing coverage. Please review.

Project coverage is 52.74%. Comparing base (76a118b) to head (ad0488c).
Report is 4 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/trainer/auto_trainer.py 0.00% 34 Missing ⚠️
paddlenlp/trainer/trainer.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9217      +/-   ##
===========================================
- Coverage    53.11%   52.74%   -0.38%     
===========================================
  Files          665      661       -4     
  Lines       109041   107375    -1666     
===========================================
- Hits         57918    56634    -1284     
+ Misses       51123    50741     -382     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

From00
From00 previously approved these changes Sep 29, 2024
Copy link
Collaborator

@From00 From00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhangyuqin1998 zhangyuqin1998 force-pushed the fix_split_batches_for_accumulation branch from 225b3a7 to 513086c Compare October 8, 2024 08:29
From00
From00 previously approved these changes Oct 12, 2024
Copy link
Contributor

@liym27 liym27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhangyuqin1998 zhangyuqin1998 force-pushed the fix_split_batches_for_accumulation branch from cadd4d4 to f286f90 Compare October 18, 2024 01:23
Copy link
Collaborator

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wawltor wawltor merged commit a564483 into PaddlePaddle:develop Oct 23, 2024
10 of 12 checks passed
lvdongyi pushed a commit to lvdongyi/PaddleNLP that referenced this pull request Oct 23, 2024
PaddlePaddle#9217)

* [Auto Parallel] fix bugs for split_batches_for_accumulation && fix bugs for enable_delay_scale_loss

* add enable_delay_scale_loss flag for auto_parallel

* fix ut

* Update ci_case_auto.sh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants